Vega-Lite文档: 03_transform

Offical documentation

数据转换

数据变换有两种途径:

  1. transform: view-level transforms, 在整个view中指定转换逻辑

  2. encoding: field transforms inside, 在数据列中指定转换逻辑

  • 如果两种转换方式都提供, 会先执行transform,然后再执行内联的转换, 且转换顺序为: bin -> timeUnit -> aggregate -> sort -> stack

  • transform中的转换顺序是按照出现的顺序进行的, 支持如下转换:

Aggregate汇总

  • 汇总操作由aggregategroupby完成

  • aggregate支持三个属性: op指定汇总操作(函数), field指定处理哪列, as指定输出列名

  • groupby指定分组处理的规则, 传入的是列名向量

aggregate支持的操作:

OperationDescription
countThe total count of data objects in the group.Note: count operates directly on the input objects and return the same value regardless of the provided field.
validThe count of field values that are not null, undefined or NaN.
valuesA list of data objects in the group.
missingThe count of null or undefined field values.
distinctThe count of distinct field values.
sumThe sum of field values.
productThe product of field values.
meanThe mean (average) field value.
averageThe mean (average) field value. Identical to mean.
varianceThe sample variance of field values.
variancepThe population variance of field values.
stdevThe sample standard deviation of field values.
stdevpThe population standard deviation of field values.
stderrThe standard error of field values.
medianThe median field value.
q1The lower quartile boundary of field values.
q3The upper quartile boundary of field values.
ci0The lower boundary of the bootstrapped 95% confidence interval of the mean field value.
ci1The upper boundary of the bootstrapped 95% confidence interval of the mean field value.
minThe minimum field value.
maxThe maximum field value.
argminAn input data object containing the minimum field value.Note: When used inside encoding, argmin must be specified as an object. (See below for an example.)
argmaxAn input data object containing the maximum field value.Note: When used inside encoding, argmax must be specified as an object. (See below for an example.)

argmin/argmax

用于查找与另一字段中的极值相对应的当前字段的值:

{
    "data": {"url": "/assets/data/movies.json"},
    "mark": "bar",
    "encoding": {
        "x": {
        //查找US Gross最大值对应的Production Budget
        "aggregate": {"argmax": "US Gross"}, 
        "field": "Production Budget",
        "type": "quantitative"
        },
        "y": {"field": "Major Genre", "type": "nominal"}
    }
}

以上是用的encoding方式, 也可以用transform的形式:

{
    "data": {"url": "/assets/data/movies.json"},
    "transform": [{
        "aggregate": [{
            "op": "argmax",
            "field": "US Gross",
            "as": "argmax_US_Gross"
        }],
        "groupby": ["Major Genre"]
    }],
    "mark": "bar",
    "encoding": {
        "x": {
            "field": "argmax_US_Gross['Production Budget']",
            "type": "quantitative"
        },
        "y": {"field": "Major Genre", "type": "nominal"}
    }
}

Info 好像用encoding的方式更直观, 而且代码量更少啊, 有什么只能用transform的场景么?
argmax的一个应用场景: 通过获取X轴最后一个值, 给折线图加标签:

{
    "data": {"url": "/assets/data/stocks.csv"},
    "transform": [{"filter": "datum.symbol !== 'IBM"}],
    "encoding": {
        "color": {
            "field":"symbol",
            "type":"nominal",
            "legend": null
        }
    },
    "layer": [{
    "mark": "line",
    "encoding": {
        "x": {"field": "date", "type": "temporal", "title": "date"},
        "y": {"field": "price", "type": "quantitative", "title": "price"}
    }
    },{
    "encoding": {
        "x": {"aggregate": "max", "field": "date", "type": "temporal"},
        "y": {"aggregate": {"argmax": "date"}, "field": "price", "type": "quantitative"}
    },
    "layer": [{
        "mark": {"type": "circle"}
    }, {
        "mark": {"type": "text", "align": "left", "dx": 4},
        "encoding": {"text": {"field":"symbol", "type": "nominal"}}
    }]
    }],
    "config": {"view": {"stroke": null}}
}

Bin分箱

  • 可以用来做histogram

  • encodingtransform

encoding中使用bin

// A Single View or a Layer Specification
{
    ...,
    "mark/layer": ...,
    "encoding": {
        "x": {
        "bin": ..., // bin
        "field": ...,
        "type": "quantitative",
        ...
        },
        "y": ...,
        ...
    },
    ...
}

encoding中, 直接用bin属性操作, bin属性支持传入的类型有:

  • true: 使用默认的分箱参数, 默认是false

  • binned: 表明数据已经分箱过了, 可以把bin-start和bin-end映射到x/y和x2/y2

{
    "data": {"url": "/assets/data/movies.json"},
    "mark": "bar",
    "encoding": {
        "x": {
            "bin": true,
            "field": "IMDB Rating"
        },
        "y": {"aggregate": "count"}
    }
}

设置分箱的维度的typeordinal, 会把分箱的范围当作刻度标签:

{
    "data": {"url": "/assets/data/movies.json"},
    "mark": "bar",
    "encoding": {
        "x": {
            "bin": true,
            "field": "IMDB Rating",
            "type": "ordinal"
        },
        "y": {"aggregate": "count"}
    }
}

可以用bin来分配热图颜色, 会自动创建图注:

{
    "data": {"url": "/assets/data/cars.json"},
    "mark": "point",
    "encoding": {
        "x": {"field": "Horsepower", "type": "quantitative"},
        "y": {"field": "Miles_per_Gallon", "type": "quantitative"},
        "color": {"bin": true, "field": "Acceleration"}
    }
}

直接导入已经分箱好的数据:

{
    "data": {
        "values": [
        {"bin_start": 8, "bin_end": 10, "count": 7},
        {"bin_start": 10, "bin_end": 12, "count": 29},
        {"bin_start": 12, "bin_end": 14, "count": 71},
        {"bin_start": 14, "bin_end": 16, "count": 127},
        {"bin_start": 16, "bin_end": 18, "count": 94},
        {"bin_start": 18, "bin_end": 20, "count": 54},
        {"bin_start": 20, "bin_end": 22, "count": 17},
        {"bin_start": 22, "bin_end": 24, "count": 5}
        ]
    },
    "mark": "bar",
    "encoding": {
        // 注意这里的用法, binned, x2
        "x": {
        "field": "bin_start",
        "bin": {"binned": true, "step": 2}
        },
        "x2": {"field": "bin_end"},
        "y": {
        "field": "count",
        "type": "quantitative"
        }
    }
}

transform中使用bin

// Any View Specification
{
    ...
    "transform": [
        {"bin": ..., "field": ..., "as" ...} // Bin Transform
        ...
    ],
    ...
}

transform中使用bin, 有bin, field, as三个可选属性参数.

例子: 用bin生成新的列

{
  "data": {"url": "/assets/data/movies.json"},
  "transform": [
    {
      "bin": true,
      "field": "IMDB Rating",
      "as": "binned rating"
    }
  ],
  "mark": "bar",
  "encoding": {
    "x": {
      "field": "binned rating",
      "title": "IMDB Rating (binned)",
      "bin": {
        "binned": true,
        "step": 1
      }
    },
    "x2": {"field": "binned rating_end"},
    "y": {"aggregate": "count"}
  }
}

Bin的可选参数

PropertyTypeDescription
anchorNumberA value in the binned domain at which to anchor the bins, shifting the bin boundaries if necessary to ensure that a boundary aligns with the anchor value. Default value: the minimum bin extent value
baseNumberThe number base to use for automatic bin determination (default is base 10). Default value: 10
divideNumber[]Scale factors indicating allowable subdivisions. The default value is [5, 2], which indicates that for base 10 numbers (the default base), the method may consider dividing bin sizes by 5 and/or 2. For example, for an initial step size of 10, the method can check if bin sizes of 2 (= 10/5), 5 (= 10/2), or 1 (= 10/(5*2)) might also satisfy the given constraints. Default value: [5, 2]
extentArrayA two-element ([min, max]) array indicating the range of desired bin values.
maxbinsNumberMaximum number of bins. Default value: 6 for row, column and shape channels; 10 for other channels
minstepNumberA minimum allowable step size (particularly useful for integer values).
niceBooleanIf true, attempts to make the bin boundaries use human-friendly boundaries, such as multiples of ten. Default value: true
stepNumberAn exact step size to use between bins. Note: If provided, options such as maxbins will be ignored.
stepsNumber[]An array of allowable step sizes to choose from.

示例: 更改最大分箱数目maxbins:

{
  "data": {"url": "/assets/data/movies.json"},
  "mark": "bar",
  "encoding": {
    "x": {
      "bin": {"maxbins": 30},
      "field": "IMDB Rating"
    },
    "y": {"aggregate": "count"}
  }
}

分箱排序

如果需要对分箱结果进行排序, 可以设置"type":"ordinal":

{
  "data": {"url": "/assets/data/movies.json"},
  "mark": "bar",
  "encoding": {
    "x": {
      "bin": true,
      "field": "IMDB Rating",
      "type": "ordinal",
      "sort": {
        "op": "count",
        "order": "descending"
      }
    },
    "y": {"aggregate": "count"}
  }
}

Calculate

// Any View Specification
{
  ...
  "transform": [
    {"calculate": ..., "as" ...} // Calculate Transform
     ...
  ],
  ...
}

两个属性: calculate, as

calculate支持传入数据的列组成的表达式, 需要用datum代表传入数据, 比如2*datum.col1+datum.col2. 支持的表达式很多, 这里不展开了, 请参考Vega表达式.

{
  "data": {
    "values": [
      {"a": "A", "b": 28},
      {"a": "B", "b": 55},
      {"a": "C", "b": 43},
      {"a": "G", "b": 19},
      {"a": "H", "b": 87},
      {"a": "I", "b": 52},
      {"a": "D", "b": 91},
      {"a": "E", "b": 81},
      {"a": "F", "b": 53}
    ]
  },
  "transform": [
    {"calculate": "2*datum.b", "as": "b2"},
    {"filter": "datum.b2 > 60"}
  ],
  "mark": "bar",
  "encoding": {
    "y": {"field": "b2", "type": "quantitative"},
    "x": {"field": "a", "type": "ordinal"}
  }
}

Density

指定维度计算核密度估计, 生成密度分布曲线.

/ Any View Specification
{
  ...
  "transform": [
    {"density": ...} // Density Transform
     ...
  ],
  ...
}

density参数

PropertyTypeDescription
densityStringRequired. The data field for which to perform density estimation.
groupbyString[]The data fields to group by. If not specified, a single group containing all data objects will be used.
cumulativeBooleanA boolean flag indicating whether to produce density estimates (false) or cumulative density estimates (true). Default value: false
countsBooleanA boolean flag indicating if the output values should be probability estimates (false) or smoothed counts (true). Default value: false
bandwidthNumberThe bandwidth (standard deviation) of the Gaussian kernel. If unspecified or set to zero, the bandwidth value is automatically estimated from the input data using Scott’s rule.
extentNumber[]A [min, max] domain from which to sample the distribution. If unspecified, the extent will be determined by the observed minimum and maximum values of the density value field.
minstepsNumberThe minimum number of samples to take along the extent domain for plotting the density. Default value: 25
maxstepsNumberThe maximum number of samples to take along the extent domain for plotting the density. Default value: 200
stepsNumberThe exact number of samples to take along the extent domain for plotting the density. If specified, overrides both minsteps and maxsteps to set an exact number of uniform samples. Potentially useful in conjunction with a fixed extent to ensure consistent sample points for stacked densities.
asString[]The output fields for the sample value and corresponding density estimate. Default value: ["value", "density"]

示例1: 简单的密度图

{
  "data": {
    "url": "/assets/data/movies.json"
  },
  "width": 400,
  "height": 400,
  "transform":[{
    "density": "IMDB Rating",
    "bandwidth": 0.3
  }],
  "mark": "area",
  "encoding": {
    "x": {
      "field": "value",
      "title": "IMDB Rating",
      "type": "quantitative"
    },
    "y": {
      "field": "density",
      "type": "quantitative"
    }
  }
}

示例2: 分组堆叠密度图: group分组, extent限定范围, 在encoding中设置按照分组列上色

{
  "title": "Distribution of Body Mass of Penguins",
  "width": 400,
  "height": 300,
  "data": {
    "url": "/assets/data/penguins.json"
  },
  "mark": "area",
  "transform": [
    {
      "density": "Body Mass (g)",
      "groupby": ["Species"],
      "extent": [2500, 6500]
    }
  ],
  "encoding": {
    "x": {"field": "value", "type": "quantitative", "title": "Body Mass (g)"},
    "y": {"field": "density", "type": "quantitative", "stack": "zero"},
    "color": {"field": "Species", "type": "nominal"}
  }
}

示例3: 分面 在encoding中设置按照分组列画Y轴

{
  "title": "Distribution of Body Mass of Penguins",
  "width": 400,
  "height": 80,
  "data": {
    "url": "/assets/data/penguins.json"
  },
  "mark": "area",
  "transform": [
    {
      "density": "Body Mass (g)",
      "groupby": ["Species"],
      "extent": [2500, 6500]
    }
  ],
  "encoding": {
    "x": {"field": "value", "type": "quantitative", "title": "Body Mass (g)"},
    "y": {"field": "density", "type": "quantitative", "stack": "zero"},
    "row": {"field": "Species"}
  }
}

Filter

根据指定规则过滤数据:

// Any View Specification
{
  ...
  "transform": [
    {"filter": ...} // Filter Transform
     ...
  ],
  ...
}

filter接收Predicate格式的传入, 可以是:

  • expression: 例如{filter: "datum.b2 > 60"}

  • field predicates: equal, lt, lte, gt, gte, range, oneOf, valid, 具体参考field predicates

  • selection predicate: 选择条件的名字, 或者逻辑组合, 参考selection predicate

  • 上述条件的逻辑组合, and, or, not, 参考logical composition

Flatten

把向量转换成列表形式: 转换逻辑是一一对应, 如果不等长, 短的那个用null补齐

//对于如下表:
[
  {"key": "alpha", "foo": [1, 2], "bar": ["A", "B"]},
  {"key": "beta", "foo": [3, 4, 5], "bar": ["C", "D"]}
]

//应用flatten:
{"flatten": ["foo", "bar"]}

//变成如下表:
[
  {"key": "alpha", "foo": 1, "bar": "A"},
  {"key": "alpha", "foo": 2, "bar": "B"},
  {"key": "beta", "foo": 3, "bar": "C"},
  {"key": "beta", "foo": 4, "bar": "D"},
  {"key": "beta", "foo": 5, "bar": null}
]

一个进阶用法的例子 (mark, 有点复杂 我还没看):

{
  "data": {
    "values": [
      { "id": "001",
        "ra": 243.35,
        "dec": "+54.6",
        "lc": [{"time": 1, "mag": 18.5}, {"time": 2, "mag": 19}]
      },
      { "id": "002",
        "ra": 210.35,
        "dec": "+14.6",
        "lc": [{"time": 1, "mag": 19.5}, {"time": 2, "mag": 20}]
      },
      { "id": "003",
        "ra": 143.35,
        "dec": "+33.6",
        "lc": [{"time": 1, "mag": 19}, {"time": 2, "mag": 18}]
      }
    ]
  },
  "transform": [
    {"flatten": ["lc"]}
  ],
  "vconcat": [
    {
      "width": 300,
      "height": 200,
      "title": "Sky position",
      "transform": [{"aggregate": [], "groupby": ["ra", "dec", "id"]}],
      "mark": "circle",
      "params": [{
        "name": "pts",
        "select": {"type": "point", "fields": ["id"]}
      }],
      "encoding": {
        "x": {"field": "ra", "type": "quantitative", "scale": {"zero": false}},
        "y": {"field": "dec", "type": "quantitative"},
        "color": {
          "condition": {"param": "pts", "value": "steelblue"},
          "value": "grey"
        },
        "size": {"value": 100}
      }
    },
    {
      "width": 300,
      "height": 200,
      "title": "Light curve",
      "transform": [{"filter": {"param": "pts"}}],
      "mark": "line",
      "encoding": {
        "x": {
          "field": "lc.time",
          "type": "quantitative",
          "scale": {"zero": false}
        },
        "y": {"field": "lc.mag", "type": "quantitative"},
        "color": {"value": "steelblue"},
        "detail": {"field": "id", "type": "nominal"}
      }
    }
  ]
}

Fold

fold: 把指定列(可以是多列)转换成"key-value"对, 类似于宽表变长表

//原表
[
  {"country": "USA", "gold": 10, "silver": 20},
  {"country": "Canada", "gold": 7, "silver": 26}
]

// 折叠这两列
{"fold": ["gold", "silver"]}


//新表: 这两列都变成key-value了
[
  {"key": "gold", "value": 10, "country": "USA", "gold": 10, "silver": 20},
  {"key": "silver", "value": 20, "country": "USA", "gold": 10, "silver": 20},
  {"key": "gold", "value": 7, "country": "Canada", "gold": 7, "silver": 26},
  {"key": "silver", "value": 26, "country": "Canada", "gold": 7, "silver": 26}
]

Impute

这个是对数据进行补齐处理的, 我的应用场景, 不太需要用Vega-Lite进行数据处理, 所以先不学习这块。

Todo 补充impute的用法: 官方文档

Join Aggregate

agggregate操作生成的新列与原列进行join

操作跟aggregate类似:

  • joinaggregate: 支持op, field, as属性

  • groupby

一个例子: 偏离均值的程度 选取评分高于平均评分2.5分的电影:

{
  "data": {"url": "/assets/data/movies.json"},
  "transform": [
    {"filter": "datum['IMDB Rating'] != null"},
    {
      "joinaggregate": [{
        "op": "mean",
        "field": "IMDB Rating",
        "as": "AverageRating"
      }]
    },
    {"filter": "(datum['IMDB Rating'] - datum.AverageRating) > 2.5"}
  ],
  "layer": [
    {
      "mark": "bar",
      "encoding": {
        "x": {
          "field": "IMDB Rating", "type": "quantitative",
          "title": "IMDB Rating"
        },
        "y": {"field": "Title", "type": "ordinal"}
      }
    },
    {
      "mark": {"type": "rule", "color": "red"},
      "encoding": {
        "x": {
          "aggregate": "average",
          "field": "AverageRating",
          "type": "quantitative"
        }
      }
    }
  ]
}

或者, 不过滤, 而是把高于/低于均值的电影高亮出来:

{
  "data": {
    "url": "/assets/data/movies.json"
  },
  "transform": [
    {"filter": "datum['IMDB Rating'] != null"},
    {"filter": {"timeUnit": "year", 
      "field": "Release Date", "range": [null, 2019]}},
    {
      "joinaggregate": [{
        "op": "mean",
        "field": "IMDB Rating",
        "as": "AverageRating"
      }]
    },
    {
      "calculate": "datum['IMDB Rating'] - datum.AverageRating",
      "as": "RatingDelta"
    }
  ],
  "mark": "point",
  "encoding": {
    "x": {
      "field": "Release Date",
      "type": "temporal"
    },
    "y": {
      "field": "RatingDelta",
      "type": "quantitative",
      "title": "Rating Delta"
    },
    "color": {
      "field": "RatingDelta",
      "type": "quantitative",
      "scale": {"domainMid": 0},
      "title": "Rating Delta"
    }
  }
}

Loess

局部加权回归Loess进行平滑操作:生成趋势线

PTD
loessString需要loess的数据列
onString自变量列
groupbyString[]分组列们
bandwidthNumber[0,1]范围的频宽取值, 控制平滑程度
asString[]输出列名

{
  "data": {
    "url": "/assets/data/movies.json"
  },
  "layer": [
    {
      "mark": {
        "type": "point",
        "filled": true
      },
      "encoding": {
        "x": {
          "field": "Rotten Tomatoes Rating",
          "type": "quantitative"
        },
        "y": {
          "field": "IMDB Rating",
          "type": "quantitative"
        }
      }
    },
    {
      "mark": {
        "type": "line",
        "color": "firebrick"
      },
      "transform": [
        {
          "loess": "IMDB Rating",
          "on": "Rotten Tomatoes Rating"
        }
      ],
      "encoding": {
        "x": {
          "field": "Rotten Tomatoes Rating",
          "type": "quantitative"
        },
        "y": {
          "field": "IMDB Rating",
          "type": "quantitative"
        }
      }
    }
  ]
}

Lookup

查找与主数据源中指定字段相匹配的副数据中对应的对象。

Note 就是我们日常分析中的join操作啊
PTD
lookupString主数据中的Key
fromLookupDate/LookupSelection副数据源
asString[]
defaultAny匹配失败时分配的默认值, 默认是null

副数据属性:

PTD
dataData数据源
keyString副数据的key
fieldsString[]指定要匹配的字段, 默认匹配全部对象

一个例子:

输入数据表格如下:

lookup_groups.csv:

groupperson
1Alan
1George
1Fred
2Steve
2Nick
2Will
3Cole
3Rick
3Tom

lookup_people.csv:

nameageheight
Alan25180
George32174
Fred39182
Steve42161
Nick23180
Will21168
Cole51160
Rick63181
Tom54179

按照人名合并两个表格:

{
  "data": {"url": "/assets/data/lookup_groups.csv"},
  "transform": [{
    "lookup": "person",
    "from": {
      "data": {"url": "/assets/data/lookup_people.csv"},
      "key": "name",
      "fields": ["age", "height"]
    }
  }],
  "mark": "bar",
  "encoding": {
    "x": {"field": "group"},
    "y": {"field": "age", "aggregate": "mean"}
  }
}

进阶用法:

lookup还支持把select交互动作的名字param当作数据源. 以下例子用lookup做炫酷的交互:

{
  "data": {
    "url": "/assets/data/stocks.csv",
    "format": {"parse": {"date": "date"}}
  },
  "width": 650,
  "height": 300,
  "layer": [
    {
      //在这里定义交互规则
      "params": [{
        "name": "index",
        "value": [{"x": {"year": 2005, "month": 1, "date": 1}}],
        "select": {
          "type": "point",
          "encodings": ["x"],
          "on": "mouseover",
          "nearest": true
        }
      }],
      "mark": "point",
      "encoding": {
        "x": {"field": "date", "type": "temporal", "axis": null},
        "opacity": {"value": 0}
      }
    },
    {
      "transform": [
        {
          "lookup": "symbol",
          //这里的from设置成交互规则的param
          "from": {"param": "index", "key": "symbol"}
        },
        {
          //形成的新表就是把index添加到原始stock表中, 所以可以把index对应的值跟原始表的值一起做数值计算
          "calculate": "datum.index && datum.index.price > 0 ? (datum.price - datum.index.price)/datum.index.price : 0",
          "as": "indexed_price"
        }
      ],
      "mark": "line",
      "encoding": {
        "x": {"field": "date", "type": "temporal", "axis": null},
        "y": {
          "field": "indexed_price", "type": "quantitative",
          "axis": {"format": "%"}
        },
        "color": {"field": "symbol", "type": "nominal"}
      }
    },
    {
      "transform": [{"filter": {"param": "index"}}],
      "encoding": {
        "x": {"field": "date", "type": "temporal", "axis": null},
        "color": {"value": "firebrick"}
      },
      "layer": [
        {"mark": {"type": "rule", "strokeWidth": 0.5}},
        {
          "mark": {"type": "text", "align": "center", "fontWeight": 100},
          "encoding": {
            "text": {"field": "date", "timeUnit": "yearmonth"},
            "y": {"value": 310}
          }
        }
      ]
    }
  ]
}

Pivot

长表转宽表, 是fold的逆操作

PTD
pivotString数据源
valueString需要转换的列, 其值最终会变成新表的列名
groupbyString[]分组列
limitNumber最大可以生成的列数, 默认是0, 就是不限制
opString对分组的value进行什么操作, 默认是sum

示例:

[
  {"country": "Norway", "type": "gold", "count": 14},
  {"country": "Norway", "type": "silver", "count": 14},
  {"country": "Norway", "type": "bronze", "count": 11},
  {"country": "Germany", "type": "gold", "count": 14},
  {"country": "Germany", "type": "silver", "count": 10},
  {"country": "Germany", "type": "bronze", "count": 7},
  {"country": "Canada", "type": "gold", "count": 11},
  {"country": "Canada", "type": "silver", "count": 8},
  {"country": "Canada", "type": "bronze", "count": 10}
]

\\进行如下pivot操作:
{
  "pivot": "type",
  "groupby": ["country"],
  "value": "count"
}

\\得到结果:
[
  {"country": "Norway", "gold": 14, "silver": 14, "bronze": 11},
  {"country": "Germany", "gold": 14, "silver": 10, "bronze": 7},
  {"country": "Canada", "gold": 11, "silver": 8, "bronze": 10}
]

Quantile

计算分位数。

PTD
quantileString要处理的列名
groupby
probsNumber[]分位数比值(0,1)列表, 如果不提供, 则使用step值
stepNumber分位数步长(默认0.01), 只有probs为空时才有用
asString[]输出列名, 默认是["prob", "value"]
{"quantile": "measure", "probs": [0.25, 0.5, 0.75]}
\\输出
[
  {prob: 0.25, value: 1.34},
  {prob: 0.5, value: 5.82},
  {prob: 0.75, value: 9.31}
];

示例: 用来生成QQ图

{
  "data": {
    "url": "/assets/data/normal-2d.json"
  },
  "transform": [
    {
      "quantile": "u",
      "step": 0.01,
      "as": [
        "p",
        "v"
      ]
    },
    {
      "calculate": "quantileUniform(datum.p)",
      "as": "unif"
    },
    {
      "calculate": "quantileNormal(datum.p)",
      "as": "norm"
    }
  ],
  "hconcat": [
    {
      "mark": "point",
      "encoding": {
        "x": {
          "field": "unif",
          "type": "quantitative"
        },
        "y": {
          "field": "v",
          "type": "quantitative"
        }
      }
    },
    {
      "mark": "point",
      "encoding": {
        "x": {
          "field": "norm",
          "type": "quantitative"
        },
        "y": {
          "field": "v",
          "type": "quantitative"
        }
      }
    }
  ]
}

Regression

支持的回归模型:

  • linear: linear(线性), \( y = a + bx \)

  • log: logarithmics(对数), \( y = a + b*log(x) \)

  • exp: exponential(指数), \( y = a * e^(bx) \)

  • pow: power(幂), \( y = a * x^b \)

  • quad: quadratic(二项), \( y = a + b * x + c * x^2 \)

  • poly: polynomial(多项), \( y = a + b * x + ... + k * x^(order) \)

PTD
regressionString因变量
onString自变量
groupbyString[]
methodString上述回归模型, 默认是linear
orderNumberpoly模型下, 多项式的项数, 默认是3
extentNumber[]趋势线的上下界
paramsBoolean是否返回回归模型的参数,而不是返回画趋势线的点, 如果是true, 会返回coef向量和rSquared
asString[]默认就是x和y的列名

一个例子:

{
  "data": {
    "url": "/assets/data/movies.json"
  },
  "layer": [
    {
      "mark": {
        "type": "point",
        "filled": true
      },
      "encoding": {
        "x": {
          "field": "Rotten Tomatoes Rating",
          "type": "quantitative"
        },
        "y": {
          "field": "IMDB Rating",
          "type": "quantitative"
        }
      }
    },
    {
      "mark": {
        "type": "line",
        "color": "firebrick"
      },
      "transform": [
        {
          "regression": "IMDB Rating",
          "on": "Rotten Tomatoes Rating"
        }
      ],
      "encoding": {
        "x": {
          "field": "Rotten Tomatoes Rating",
          "type": "quantitative"
        },
        "y": {
          "field": "IMDB Rating",
          "type": "quantitative"
        }
      }
    },
    {
      "transform": [
        {
          "regression": "IMDB Rating",
          "on": "Rotten Tomatoes Rating",
          "params": true
        },
        {"calculate": "'R²: '+format(datum.rSquared, '.2f')", "as": "R2"}
      ],
      "mark": {
        "type": "text",
        "color": "firebrick",
        "x": "width",
        "align": "right",
        "y": -5
      },
      "encoding": {
        "text": {"type": "nominal", "field": "R2"}
      }
    }
  ]
}

Sample

随机抽样, 就一个参数: sample, 指定抽样大小: {"sample": 500}

// Any View Specification
{
  ...
  "transform": [
    {"sample": 500} // Sample Transform
     ...
  ],
  ...
}

Stack

stack堆叠柱状图, 可以用在encoding中, 也可用在transform中.

  • 只适用于连续变量的x, y, theta, radius

  • zerotrue: 没有基准偏移的堆叠(类似于ggplot中的position="identity"), 基本的堆叠柱状图

  • normalize: 标准化的(类似于ggplot中的position="fill")堆叠图, 也用于画饼图

  • center: 向中心偏移的堆叠柱状图, 用于生成流线图

  • nullfalse: 各个分组互相重叠

示例很多, 不放了, 参考这里

一个进阶用法例子: 不用stack, 通过计算更改数值, 实现双向堆叠:

{
  "data": { "url": "/assets/data/population.json"},
  "transform": [
    {"filter": "datum.year == 2000"},
    {"calculate": "datum.sex == 2 ? 'Female' : 'Male'", "as": "gender"},
    {"calculate": "datum.sex == 2 ? -datum.people : datum.people", "as": "signed_people"}
  ],
  "width": 500,
  "height": 300,
  "mark": "bar",
  "encoding": {
    "y": {
      "field": "age",
      "axis": null, "sort": "descending"
    },
    "x": {
      "aggregate": "sum", "field": "signed_people",
      "title": "population",
      "axis": {"format": "s"}
    },
    "color": {
      "field": "gender",
      "scale": {"range": ["#675193", "#ca8861"]},
      "legend": {"orient": "top", "title": null}
    }
  },
  "config": {
    "view": {"stroke": null},
    "axis": {"grid": false}
  }
}

另一个进阶用法: 对折线堆叠时, 要显式声明偏移量:

{
  "data": {"url": "/assets/data/population.json"},
  "transform": [
    {"filter": "datum.year == 2000"},
    {"calculate": "datum.sex == 2 ? 'Female' : 'Male'", "as": "gender"}
  ],
  "layer": [
    {
      "mark": {"opacity": 0.7, "type": "area"},
      "encoding": {
        "y": {"aggregate": "sum", "field": "people", "type": "quantitative"},
        "x": {"field": "age", "type": "nominal"},
        "color": {
          "field": "gender",
          "scale": {"range": ["#675193", "#ca8861"]},
          "type": "nominal"
        },
        "opacity": {"value": 0.7}
      }
    },
    {
      "mark": {"type": "line"},
      "encoding": {
        "y": {
          "aggregate": "sum",
          "field": "people",
          "type": "quantitative",
          "stack": "zero"
        },
        "x": {"field": "age", "type": "nominal"},
        "color": {
          "field": "gender",
          "scale": {"range": ["#675193", "#ca8861"]},
          "type": "nominal"
        },
        "opacity": {"value": 0.7}
      }
    }
  ]
}

transform中使用stack, 支持如下属性: stack, groupby, offset, sort, as 其中, sort可以用来对堆叠结果进行排序, 类似ggplot2中根据levels排序

另外还有两个进阶用法, 平时用不太上, 代码就不贴了, 需要的自己看:

自定义偏移:

马赛克图:

Time Unit

我平时不怎么处理时间序列相关的数据, 这个就先跳过了 原文档: here

Window

对已排序的数组对象执行计算(如: ranking, lead/lag, aggregates), 结果返回输入数据流

Transform Parameters

PTD
sortCompare定义数据比较顺序
groupbyField[]分组统计
opsString[]具体操作, 如rank,lead,sum等, 具体见表window operation reference
fieldsField[]要计算的字段, 该字段数组要与ops、as和params数组对齐
paramsArraywindows函数的参数值, 与ops对齐
asString[]ops操作的输出字段名称, 如果不指定, 则根据操作自动生成
frameNumber[]二元数组配置滑窗参数, [-5,5]表示窗口包含当前对象和前后各5个对象, 默认是[null,0], 表示当前对象和所有之前对象(null->无限对象)
ignorePeersBoolean滑窗是否忽略Peer values (Peer values是sort中排序相同的值), 默认是false

Window operation reference

window 中的有效操作, 包含所有 aggregate操作以及以下操作:

OperationParameterDescription
row_numberNone分配1开始的行号
rankNone分配1开始的排名, 相同排名并列, 随后排名包含先前数量, 如: 1,1,3,3,5
dense_rankNone从1开始排名, 相同并列, 随后不包含先前数量, 如:1,1,2,2,3
percent_rankNone分配百分比排名, 计算方法: \((rank-1)/(group_size - 1)\)
cume_distNone分配0-1之间的累积分布值
ntileNumber分位数, 参数为百分制整数(eg: 百分位数100, 五分位数5)
lagNumber当前对象之前指定偏移量的值, 如果不存在则输出null, 偏移量默认为1
leadNumber当前对象后指定偏移量的值
first_valueNone当前滑窗的第一个值
last_valueNone当前滑窗的最后一个值
nth_valueNumber当前滑窗的第n个值
prev_valueNone返回排序数组中(含当前字段)最近的前一个非缺失值
next_valueNone返回排序数组中(含当前字段)最近的后一个非缺失值

Example

Wordcloud

Example

变换参数

PTD
fontString|Expr字体
fontStyleString|Expr字体样式
fontWeightString|Expr字体粗细
fontSizeNumber|Expr字体大小
fontSizeRangeNumber[]大小范围, 如果指定了范围且没指定fontSize,则根据平方根比例在范围内自动缩放
paddingNumber|Expr单词的padding
rotateNumber|Expr角度(度为单位)
textField文本内容
sizeNumber[]布局大小, [width, height]
spiralString布局方法, archimedean(默认)或rectangular
asString输出字段, 默认为["x", "y", "font", "fontSize", "fontStyle", "fontWeight", "angle"]
Note wordcloud 要求文本标记具有align:centerbaseline: alphabetic属性, 否则文本定位会不准确。